Hierarchical Semi-Markov Conditional Random 
Fields for Recursive Sequential Data 

Tran The Truyen f , Dinh Q. Phung \ Hung H. Bui ** and Svetha Venkatesh ] 
t Department of Computing, Curtin University of Technology 
GPO Box U1987 Perth, WA 6845, Australia 

the truyen .tran@postgrad.curtin.edu.au 
{D . Phung, S . Venkatesh}@curt in . edu . au 

* Artificial Intelligence Center, SRI International 
333 Ravenswood Ave, Menlo Park, CA 94025, USA 

bui@ai . sri . com 

September 13, 2010 



Abstract 

Inspired by the hierarchical hidden Markov models (HHMM), we present the 
hierarchical semi-Markov conditional random field (HSCRF), a generalisation of 
embedded undirected Markov chains to model complex hierarchical, nested Markov 
processes. It is parameterised in a discriminative framework and has polynomial 
time algorithms for learning and inference. Importantly, we consider partially- 
supervised learning and propose algorithms for generalised partially-supervised 
learning and constrained inference. We demonstrate the HSCRF in two applica- 
tions: (i) recognising human activities of daily living (ADLs) from indoor surveil- 
lance cameras, and (ii) noun-phrase chunking. We show that the HSCRF is capa- 
ble of learning rich hierarchical models with reasonable accuracy in both fully and 
partially observed data cases. 
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1 Introduction 

Modelling hierarchical aspects in complex stochastic processes is an important research 
issue in many application domains. In an hierarchy, each level is an abstraction of 
lower level details. Consider, for example, a frequent activity performed by human like 
'eat-breakfast' may include a series of more specific activities like 'enter-kitchen', 'go- 
to-cupboard', 'take-cereal', 'wash-dishes' and 'leave-kitchen'. Each specific activity 
can be decomposed into finer details. Similarly, in natural language processing (NLP) 
syntax trees are inherently hierarchical. In a partial parsing task known as noun-phrase 



(NP) chunking (iSang and Buchholz , 1200(1 there are three semantic levels: the sen- 



tence, noun-phrases and part-of-speech (POS) tags. In this setting, the sentence is a 
sequence of NPs and non-NPs and each phrase is a sub-sequence of POS tags. 

A popular approach to deal with hierarchical data is to build a cascaded model 
where each level is modelled separately, an d the output of the lower level is used as 



the input of the level right above it (e.g. see (lOliver et all 120041) ). For instance, in NP 
chunking this approach first builds a POS tagger and then constructs a chunker that 
incorporates the output of the tagger. This approach is clearly sub-optimal because 
the POS tagger takes no information of the NPs and the chunker is not aware of the 
reasoning of the tagger. In contrast, a noun-phrase is often very informative to infer the 
POS tags belonging to the phrase. As a result, this layered approach often suffers from 
the so-called cascading error problem as the error introduced from the lower layer will 
propagate to higher levels. 

A more holistic approach is to build a joint representation of all the levels. For- 
mally, we are given a data observation z and we need to model and infer about the 
joint semantic x. The main problem is to choose an appropriate representation of x 
so that inference can be efficient. In this paper, we are interested in a specific class of 
hierarchical models that supports both joint modelling and efficient inference. More 
specifically, the models of interest are recursive and sequential, in that each level is a 
sequence and each node in a sequence can be decomposed further into a sub-sequence 
of finer grain at the lower level. 

There has been substantial investigation of these typ es of model, especially in the 



area of probabilistic context-free grammars (e.g. see dManning and Schiitze , 1 19991 



Chapter 11)). However, grammars are often unbounded in depth and thus difficult 
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to represent by graphical model s. A more restrict ed version known as hierarchical hid 



den Markov model (HHMM) (Fi ne et al.i 1 1 9981) offers clearer representation in that 



the depth is fixed and the semantic levels are well defined. Essentially, the HHMM is 
a nested hidden Markov network (HMM) in the sense that each state is a sub HMM by 
itself. 

These models share a common property in that they are generative, i.e. they as- 
sume that the data observation is generated by the hierarchical semantics. The gen- 
erative models try to construct the the joint distribution Pr(x,z) = Pr(z|x) Pr(x). 
However, there are some drawbacks associated with this approach. First, the gener- 
ative process modelled by Pr(z|a;) is typically unknown and complicated. Second, 
given an observation z, we are more often interested in inferring Pr(x|z). Since 
Pr(x, z) = Pr(a;|z) Pr(z), modelling Pr(z) may be unnecessary. 

An attractive alternative is to model the distribution Pr(a;|z) directly, avoiding the 
modelling of z. This line of research has recently attracted muc h interest, largely trig - 
gered by the introduction of the conditional random field (CRF) ( ILaffertv et a/.ll200ll) . 
The advantages of the CRF is largely attributed to its discriminative nature that allows 
arbitrary and long-range interdependent features. 

In this paper we follow the HMM/HHMM path to generalise from chain-structured 
CRFs to nested CRFs. As a result, we construct a novel model called Hierarchical 
Semi-Markov Conditional Random Field (HSCRF), which is an undirected conditional 
graphical model of nested Markov chains. Thus HSCRF is the combination of the 
discriminative nature of CRFs and the nested modelling of the HHMM. 

To be more concrete let us return to the NP chunking example. The problem can 
be modelled as a three-level HSCRF, where the root represents the sentence, the sec- 
ond level the NP process, and the bottom level the POS process. The root and the two 
processes are conditioned on the sequence of words in the sentence. Under the dis- 
criminative modelling of the HSCRF, rich contextual information such as starting and 
ending of the phrase, the phrase length, and the distribution of words falling inside the 
phrase can be effectively encoded. On the other hand, such encoding is much more 
difficult for HHMMs. 

We then proceed to address important issues. Fi rst, we show ho w to represent 
HSCRFs using a dynamic graphical model (e.g. see dLauritzen , ll996lD which effec- 
tively encodes hierarchical and temporal semantics. For parameter learni ng, an efficient 
algorithm based on the Asymmetric Inside-Outside of dBui et all 120041) is introduced. 
For inference, we generalise the Viterbi algorithm to decode the semantics from an 
observational sequence. 

The common assumptions in discriminative learning and inference are that the 
training data in learning is fully labelled, and the test data during inference is not 
labelled. We propose to relax these assumptions in that training labels may only be 
partially available, and we term the learning as partial-supervision. Likewise, when 
some labels are given during inference, the algorithm should automatically adjust to 
meet the new constraints. 

We demonstrate the effectiveness of HSCRFs in two applications: (i) segmenting 
and labelling activities of daily living ( ADLs) in an indoor environment and (ii) jointly 
modeling noun-phrases and part-of-speeches in shallow parsing. Our experimental re- 
sults in the first application show that the HSCRFs are capable of learning rich, hierar- 



4 



chical activities with good accuracy and exhibit better performance when compared to 
DCRFs and flat-CRFs. Results for the partially observable case also demonstrate that 
significant reduction of training labels still results in models that perform reasonably 
well. We also show that observing a small amount of labels can significantly increase 
the accuracy during decoding. In shallow parsing, the HSCRFs can achieve higher 
accuracy than standard CRF-based techniques and the recent DCRFs. 
To summarise, in this paper we claim the following contributions: 

• Introducing a novel Hierarchical Semi-Markov Conditional Random Field (HSCRF) 
to model complex hierarchical and nested Markovian processes in a discrimina- 
tive framework, 

• Developing an efficient generalised Asymmetric Inside-Outside (AIO) algorithm 
for full-supervised learning. 

• Generalising the Viterbi algorithm for decoding the most probable semantic la- 
bels and structure given an observational sequence. 

• Addressing the problem of partially-supervised learning and constrained infer- 
ence. 

• Demonstration of the applicability of the HSCRFs for modeling human activities 
in the domain of home video surveillance and shallow parsing of English. 

Notations and Organisation 

This paper makes use of a number of mathematical notations which we include in 
Table[T]for reference. 

The rest of the paper is organised as follows. Section previews Conditional Ran- 
dom Fields. Section [3] continues with the HSCRF model definition and parameterisa- 
tion. Section [4] defines building blocks required for common inference tasks. These 
blocks are computed in Section |4~2l and |4~3l Section |5]presents the generalised Viterbi 
algorithm. Parameter estimation follows in Section [6] Learning and inference with 
partially available labels are addressed in Section [7] Section [8] presents a method for 
numerical scaling to prevent numerical overflow. Section [9] documents experimental 
results. Section [TOlconcludes the paper. 



2 Related Work 

2.1 Hierarchical Modelling of Stochastic Processes 

Hierarchical modelling of stochastic processes can be largely categorised as either 
graphic al models ex t endin g the flat hidden Mar kov models (HM M) (e.g., the layered 



HMM dOliver et ali |2004. t he abstrac t HMM dBui et ali 12002b. hier archical HMM 



(HHMM) (Fi ne et al. 



1998 



based models (e.g. 
erative. 



PCFG dPereira and Schabesl 119921) ). These models are all gen 



Bui et ofl 120041). DBN ([Murphvl 12002b ) or grammar- 
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Recent development in discriminative, hierarchical structu res include extensio n of 



the flat CRFs (e.g. dynam i c CRF s (DCRF) dSutton et a/.Ll2007h . hierarchical CRFs dLiao et al 
2007t iKumar and Heberti 120051)) and cond i tional learning of the grammars (e.g. see 
dMivao and Tsui hi 120021: IClark and Curranll2003l) ). The main problem of the DCRFs 
is that they are not scalable due to inference intractability. The hierarchical CRFs, 
on the other hand, are tractable but assume fixed tree structures, and therefore are not 
flexible to adapt to complex data. For example, in the noun-phrase chunking problem 
no prior tree structures are known. Rather, if such a structure exists, it can only be 
discovered after the model has been successfully built and learned. 

The conditional probabilistic context-free grammar (C-PCFG) appears to address 
both tractability and dynamic structure issues. More precisely, in C-PCFGs it takes 
cubic time in sequence length to parse a sentence. However, the context-free grammar 
does not limit the depth of semantic hierarchy, thus making it unnecessarily difficult 
to map many hierarchical problems into its form. Secondly, it lacks a graphical model 
representation and thus does not enjoy the rich set of approximate inference techniques 
available in graphical models. 



2.2 Hierarchical Hidden Markov Models 



Hierarchical HMMs are generalisations of HMMs ( Rabinei , 19891) in the way that a 
state in an HHMM may be a sub-HHMM. Thus, an HHMM is a nested Markov chain. 
In the model temporal evolution, when a child Markov chain terminates, it returns the 
control to its parent. Nothing from the terminated child chain is carried forward. Thus, 
the parent state abstracts out everything belonging to it. Upon receiving the return 
control the parent then either transits to a new parent, (given that the grand parent has 
not finished), or terminates. 

Figure Q] illustrates the state transition diagram of a two-level HHMM. At the top 
level there are two parent states {^4, B}. The parent A has three children, i.e. ch(A) = 
{1, 2, 3} and B has four, i.e. ch(B) = {4, 5, 6, 7}. At the top level the transitions are 
between A and B, as in a normal directed Markov chain. Under each parent there are 
also transitions between child states, which only depend on the direct parent (either A 
or B). There are special ending states (represented as shaded nodes in Figure QJ to 
signify the termination of the Markov chains. At each time step of the child Markov 
chain, a child will emit an observational symbol (not shown here). 

The temporal evolution of the HHM M can be represente d as a dynamic Bayesian 
network, which was first done in ( IMurphv and Paskiiu 120021) . Figure [2] depicts a DBN 
structure of 3 levels. The bottom level is often referred to as production level. As- 
sociated with each state is an ending indicator to signify the termination of the state. 
Denote by xf and ef the state and ending indicator at level d and time t, respectively. 



When e? 



0, the state xf continues, i.e. 



And when e. 



1, the state x 



d 

transits to a new state, or transits to itself. There are hierarchical consistency rules that 
must be ensured. Whenever a state persists (i.e. ef = 0), all of the states above it must 
also persist (i.e. ef = for all d 1 < d). Similarly, whenever a state ends (i.e ef = 1), 
all of the states below it must also end (i.e. ef = 1 for all d 1 > d). 

Inference and learning in HHMMs follow the Inside-Outside algorithm of the prob- 
abilistic context-free grammars. Overall, the algorithm has 0(\S\ 3 DT 3 ) time com- 
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Figure 1 : The state transition diagram of an HHMM. 



plexity where \S\ is the maximum size of the state space at each level, D is the depth 
of the model and T is the model length. 

When representing as a DBN, the whole stack of states x\ ' D can be collapsed into 
a 'mega-state' of a big HMM, and therefore inference can be carried out in 0(\S\ 2D T) 
time. This is efficient for a shallow model (i.e. D is small), but problematic for a deep 
model (i.e. D is large). 




Figure 2: Dynamic Bayesian network representation of HHMMs. 



2.3 Conditional Random Fields 



Denote by Q = (V, £) the graph where V is the set of vertices, and £ is the set of 
edges. Associated with each vertex i is a state variable Xj Let x be joint state variable, 
i.e. x = (xj)j e v Conditional random fields (CRFs) ( ILaffertv et all 120011) define a 
conditional distribution given the observation z as follows 



Pr(x|z) 



1 



Z(z) 



~\(f> c {x c ,z) 



(1) 
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where c is the index of cliques in the graph, <fi c (x c , z) is a non-negative potential func- 
tion defined over the clique c, and Z(z) = J2 X Yi c ficfac, z) is the partition function. 

Let {x} be the set of observed state variables with the empirical distribution Q(x), 
and w be the parameter vector. Learning in CRFs is typically by maximising the (log) 
likelihood 

w* = argmax£(w) = argmax > Q(x) logPr(i|z; w) (2) 

w w — ' 

X 

The gradient of the log-likelihood can be computed as 

V£(w) =^2Q(x)J2 (viog0 c (£ c ,z)-^Pr(x c |z)Vlog0 c (x c ,z)j (3) 

X c \ x c / 

Thus, the inference needed in CRT parameter estimation is the computation of clique 
marginals Pr(x c |z). 

Typically, CRFs are parameterised using log-linear models (also known as ex- 
ponential family, Gibbs distribution or Maximum Entropy model), i.e. <fr c (x c ,z) = 
cxp(w T f (x c , z)), where f (.) is the feature vector and w is the vector of feature weights. 
The features are also known as sufficient statistics in the exponential family setting. Let 
F(x, z) — X) c f (^e, z) be the global feature. Equation[3]can be written as follows 



VC = ^Q(x)J2[^c,z)-J2 P < x o\z)f(x c ,z)) 

X c \ x c / 



(4) 



l q(£: 



[F]-E PrWz) [F] (5) 



Thus gradient-based maximum likelihood learning in the log-linear setting boils down 
to estimating the feature expectations, also known as expected sufficient statistics (ESS). 

The probabilistic nature of CRFs allows incorporating hidden variables in a disci- 
plined manner. Let x = (■&, h), where -d is the set of visible variables, and h is the set 
of hidden variables. The incomplete log-likelihood and its gradient are given as 

C = ^g(.i)logPr( l ?|z) = ^Q(i)log^Pr(,9,/i|z) 

x x h 

= ^2Q(x)(loEZ(^,z) - log Z(z)) (6) 

X 

where Z(d, z) = J2h Tic ^c(^c, h c , z). The gradient reads 

V£ = E mz [F(#,h,z)}-E xlz [F(x,z)} 

= E E (E Pr (^< z )^o, h c , z) - J2 Pv(x c \z)f(x c , z) j (7) 

x c \ h c x a / 

There have been extensions to CRFs, which can be broadly grouped into two 
categories. The first category involves generalisation of model representation, for 
example by extending CRFs for complex temporal structures as in Dynamic CRFs 
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(DCRF s) ( Sutton et all 2007), segm ental sequences as i n Semi-Markov CR Fs (Semi- 
CRFs) ( Sarawagi and Coheni 2004 ). and relational data ( Taskar et al. , 2002 ). The sec- 
ond category in vestigates lear ning schem es other than maxim um likelihood, for exam- 
ple perceptron dColhnsl 120021) and SVM dTaskar et a/.ll2004 . 

DCRFs and SemiCRFs are the most closely related to our HSCRFs. DCR Fs are ba- 
sicall y the conditional, undirected version of the Dynamic Bayesian Networks ( Murphvi 
2002b . The DCRFs introduce multi-level of semantics, which help to represent more 
complex sequential data. The main drawback of the DCRFs is the intractability of 
inference, except for shallow models with small state space. 

Similarly, the SemiCRFs are the conditional, undirected version of the Semi-Markov 
HMMs. These allows non-Markovian processes to be embedded in the chain CRFs, 
and thus giving a possibility of modelling process duration. Appendix ICl analyses the 
SemiCRFs in more details. 

Our HSCRFs deal with the inference problem of DCRFs by limiting to recur- 
sive processes, and thus obtaining efficient inference via dynamic programming in 
the Inside-Outside family of algorithms. Furthermore, it generalises the SemiCRFs 
to model multilevel of semantics. It also addresses partial labels by introducing appro- 
priate constraints to the Inside-Outside algorithms. 



3 Model Definition of HSCRF 

Consider a hierarchically nested Markov process with D levels. Then as in the HHMMs 
(see Section 12. 2t , the parent state embeds a child Markov chain whose states may in 
turn contain child Markov chains. The family relation is defined in the model topology, 
which is a state hierarchy of depth D. The model has a set of states S d at each level 
d G [1,1?], i.e. S d = {l...|S' d |}, where \S d \ is the number of states at level d. For each 
state s d S S d where 1 < d < D, the topological structure also defines a set of children 
ch(s d ) C S d+1 . Conversely, each child s d+1 has a set of parents pa(s d+1 ) C S d . 
Unlike the original HHMMs where the child states belong exclusively to the parent, 
the HSCRFs allow arbitrary sharing of children between parents. For example, in 
Figure 12 ^(s 1 = 1) = {1,2,3}, and pa(s 3 = 1) = {1,2,4}. This helps to avoid 
an explosive number of sub-states when D is large, leading to fewer parameters and 
possibly less training d ata and time. Th e shared topology has been investigated in the 



context of HHMMs in (IBui et all 12004 . 

The temporal evolution in the nested Markov processes with sequence length of T 
operates as follows: 

• As soon as a state is created at level d < D, it initialises a child state at level 
d + 1. The initialisation continues downward until reaching the bottom level. 

• As soon as a child process at level d + 1 ends, it returns control to its parent at 
level d, and in the case of d > 1, the parent either transits to a new parent state 
or returns to the grand-parent at level d — 1. 

The main requirement for the hierarchical nesting is that the life-span of the child 
process belongs exclusively to the life-span of the parent. For example, consider a 
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parent process at level d starts a new state sf.j at time i and persists until time j. At 
time i the parent initialises a child state sf +l which continues until it ends at time 
k < j, at which the child state transits to a new child state s^+r The child process 
exits at time j, at which the control from the child level is returned to the parent sf.j. 
Upon receiving the control the parent state sf.j may transit to a new parent state Sj +Vl , 
or end at j, returning the control to the grand-parent at level d — 1. 




Figure 3: The shared topological structure. 
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Figure 4: The multi-level temporal model. 

We are now in a position to specify the nested Markov processes in a more formal 
way. Let us introduce a multi-level temporal graphical model of length T with D levels, 
starting from the top as 1 and the bottom as D (Figure 0. At each level d € [1,-D] 
and time index i S there is a node representing a state variable xf <G 5 rf = 

{1, 2, |S' d |}. Associated with each is an ending indicator ef which can be either 
1 or to signify whether the state xf ends or persists at i. The nesting nature of the 
HSCRFs is now realised by imposing the specific constraints on the value assignment 
of ending indicators (Figure|5]). 

Thus, specific value assignments of ending indicators provide contexts that realise 
the evolution of the model states in both hierarchical (vertical) and temporal (horizon- 
tal) directions. Each context at a level and associated state variables form a contextual 
clique, and we identify four contextual clique types: 

• State-persistence : This corresponds to the life time of a state at a given level 
(see Figure[6]l. Specifically, given a context c = (ef_ 1: j = (1, 0, .., 0, 1)), then 
^persist, d _ ^ ;j c ^ j s a contextual clique that specifies the life-span of 
any state s = xf, . 
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• The top state persists during the course of evolution, i.e. e[. T _ 1 — 0. 

• When a state finishes, all of its descendants must also finish, 

i.e. ef = 1 implies ef +UD = 1. 

• When a state persists, all of its ancestors must also persist, 

i.e. ef = implies e^ 1 =0. 

• When a state transits, its parent must remain unchanged, i.e. ef = 1, ef _1 = 0. 

• The bottom states do not persists, i.e. ef = 1 for all i £ [1, T}. 

• All states end at T, i.e. e]i D = 1. 



Figure 5: Hierarchical constraints. 



• State-transition : This corresponds to a state at level d £ [2, D] at time i tran- 
siting to a new state (see Figure^). Specifically, given a context c = (ef -1 = 
0, ef = 1) then cr ^ ranstt ' = [xf~y , xf. i+1 , c) is a contextual clique that specifies 
the transition of xf to xf +1 at time i under the same parent xfzl- 

• State-initialisation : This corresponds to a state at level d £ [1, D — 1] initialising 
a new child state at level d + 1 at time i (see Figure [VJd). Specifically, given a 
context c = (ef_ 1 = 1), then a l j nlt - d = [xf, xf +1 ,c) is a contextual clique that 
specifies the initialisation at time i from the parent xf to the child xf +1 . 

• State-ending : This corresponds to a state at level d £ [1, D — 1] to end at time 
i (see Figure |7J;). Specifically, given a context c = (ef = 1), then c?" ' = 
(xf , a;f +1 , c) is a contextual clique that specifies the ending of xf at time i with 
the last child xf +1 . 



Xi — \ Xi x 3 — l x j 

O /^-:::::::::"v.":::":::::.--Q-------q- x 

[• • • •; 

e^_i — 1 ei — e i-i — — 1 



Figure 6: An example of a state-persistence sub-graph. 
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Figure 7: Sub-graphs for state transition (left), initialisation (middle) and ending 
(right). 
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In the HSCRF we are interested in the conditional setting in which the entire 
state variables and ending indicators {x\:^ , e\':^) are conditioned on observational 
sequences z. For example, in computational linguistics, the observation is often the 
sequence of words and the state variables might be the part-of-speech tags and the 
phrases. 

To capture the correlation between variables and such conditioning, we define a 
non-negative potential function ip(cr, z) over each contextual clique a. Figure [8] shows 
the notations for potentials that correspond to the four contextual clique types we have 
identified above. Details of potential specification are described in the Section loTTI 

T^d.s.z 77 persist, d \ T tj 
' R i:j = > Z ) Where 3 = 

• A t%''i = ip(al ranstt ' d , z) where s = xf^ and u = xf,v = xf +v 

• w u'i' Z = ip(&l mt ' d , z) where s = xf,u = x d+1 . 

• E i'a Z = '^( cr i" d ' d ) z ) where s = xf,u = xf +1 . 



Figure 8: Shorthands for contextual clique potentials. 



Let C 



A-.D l:D\ 
'l:T! e l:T, 



denote the set of all variables that satisfies the set of hierar- 



chical constraints in Figure Let r d denote ordered set of all ending time indices at 
level d, i.e. if i e r d then ef = 1. The joint potential defined for each configuration 
is the product of all contextual clique potentials over all ending time indices i e [1, T] 
and all semantic levels d <E [1 , D] : 



n 

ie[l,D] 



n r 



x n 

de[l,D-l] 



n 



A 



d-\-l,s,z 



n 



_d,s,z 
u,i k + l 



n 



d,s,z 



The conditional distribution is given as 



Pr(C|z) 



1 



Z{z) 



(9) 



where Z(z) = ^[Ci z ] i s the partition function for normalisation. 

In what follows we omit z for clarity, and implicitly use it as part of the partition 
function Z and the potential $[.]. It should be noted that in the unconditional formu- 
lation, there is only a single Z for all data instances. In conditional setting there is a 
Z(z) for each data instance z. 



Remar ks: The tempora l model of HSCRFs presented here is not a standard graph- 
ical model (ILauritzen , 1 19961) since the connectivity (and therefore the clique structures) 
is not fixed. The potentials are defined on-the-fly depending on the context of assign- 
ments of ending indicators. Although the model topology is identical to that of shared 
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structure HHMMs dBui et all 120041) . the unrolled temporal representation is an undi- 



rected graph and the model distribution is formulated in a discriminative way. Further- 
more, the state persistence potentials capture duration i nformation that is not avail able 
in the dynamic DBN representation of the HHMMs in (IMurphv and Paskinll2002l) . 
In the way the potentials are introduced it may first appear to r esemble the clique 



templates in the discriminative relational Markov networks (RMNs) dTaskar et a/.Ll2002l) . 



It is, however, different because cliques in the HSCRFs are dynamic and context- 
dependent. 



4 Asymmetric Inside-Outside Algorithm 



This section describes a core inference engine called Asymmetric Inside-Outside (AIO) 
algorith m, which i s partly adapted from the generative, directed counter part of HH- 
MMs in dBui et all 120041) . We now show how to compute the building blocks that are 
needed in most inference and learning tasks. 



4.1 Building Blocks and Conditional Independence 



level il 



• • • 



level di 



level d+l 



ft 



f 



(a) 



(b) 



Figure 9: (a) Symmetric Markov blanket, and (b) Asymmetric Markov blanket. 



4.1.1 Contextual Markov blankets 

In this subsection we define elements that are building blocks for inference and learn- 
ing. These building blocks are identified given the corresponding boundaries. Let us 
introduce two types of boundaries: the contextual symmetric and asymmetric Markov 
blankets. 

Definition 1. A symmetric Markov blanket at level dfor a state s starting at i and 
ending at j is the following set 

n£; = (4.j *-<1 :n i-'f' = i, 4j-i = o) do) 
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Definition 2. Let II^: S be a symmetric Markov blanket, we define C.'f and £ d ' s as 

■J -J — i:j 

follows 

C; = (■>■'! - u ". 'VI') (id 

C£ = C\(C^,n^) (12) 

subject to xf.j = s. Further, we define 

C# = (C-/,n#) (13) 

£■ = a*) 

Figure [9^ shows an example of a symmetric Markov blanket (represented by a 
double-arrowed line). 

Definition 3. A asymmetric Markov blanket at level dfor a parent state s starting at i 
and a child state u ending at j is the following set 

= (xf :j = s, xj +l =u,et D x = 1, ef UD = 1, 4,-i = 0) (15) 

Definition 4. Lef IV?(u) fee an asymmetric Markov blanket, we define Q.'-{u) and 
Cf'j i u ) as follows 

= (^-f^r^.^-f) (i6) 

Cj/(«) = C\(C-/H,r^( M )) (17) 

subject to xf.j = s and xj +1 = it. Further, we define 

= r&W) ( 18 ) 

£>) = r&(«)) (19) 

Figure [9J3 shows an example of asymmetric Markov blanket (represented by an 
arrowed line). 

Remark: The concepts of contextual Markov blankets (or Markov blankets for 
short) are different from those in traditional Markov random fields and Bayesian net- 
works because they are specific assignments of a subset of variables, rather than a 
collection of variables. 

4.1.2 Conditional independence 

Given these two definitions we have the following propositions of conditional indepen- 
dence. 

Proposition 1. Q.'f and C, d ' s are conditionally independent given II.'? 

•J — l:j -J 

^!,Ct;H:D = Pr(C-/|n^)Pr(c^|n^) (20) 
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This proposition gives rise to the following factorisation 

Pr(C) = Pr(nf;/)Pr(C^,Cj|n^) = Pr(n^)Pr(C^|n^)Pr(C^|n^) (21) 

Proposition 2. (^f. ,s (u) and £? # 'f (zi) are conditionally independent given Tf^(u) 

■J — Z'.J -J 

Pr (C-/W,Ct s Wl r "'/( u )) = Pr (C-/Hir^M)Pr(C^(«)|r^(u)) (22) 

The following factorisation is a consequence of Proposition^ 

Pr(C) = Pr(rf:/( u ))Pr(c^( u ),cJW|rt s M) 

= Pr(rf;/( U ))Pr(C^ s ( U )|r^( U ))Pr(Cg(«)|r^( U )) (23) 
The proof of Propositions [TJ and [2] is given in Appendix lA.il 

4.1.3 Symmetric Inside/Outside Masses 

From Equation [T2l we have ( = {Ctf ^t'j ^Cf'j)- Since Ilf'J separates Q.'? from 
^f'f, we can group local potentials in Equation[8]into three parts: ^[C . [> an d 

$[IT^ S ]. By 'grouping' we mean to multiply all the local potentials belonging to a 
certain part, in the same way that we group all the local potentials belonging to the 
model in Equation [8] Note that although (f.'j contains IT?;? we do not group $[n^] 

into The same holds for $[C d ' 5 ]. 

•J — l:j 

By definition of the state-persistence clique potential (Figure[8]>, we have $[11^?] = 
Rf'j- Thus Equation [8] can be replaced by 

m = ^]^0 (24) 

There are two special cases: (1) when d = 1, ^[^'j,] = 1 for s G S , and (2) when 

d = D, $[C^' S ] = 1 for s 6 5 s and i G [1, T]. This factorisation plays an important 
role in efficient inference. 

We know define a quantity called symmetric inside mass A^ s , and another called 

symmetric outside mass A^ s . 

Definition 5. Given a symmetric Markov blanket n^ s , the symmetric inside mass A^ s 
and the symmetric outside mass A^'J are defined as 

a& = ( 25 ) 

^ - E<; s ] (26) 
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A d < s 

V.J 


T)d,s a d,s 


A d > s 

V.J 


Tyd.s a d,s 



As special cases we have A^ = 1 and s G S 1 , and A^' 5 = 1 for i G 
s G S D . For later use let us introduce the 'full' symmetric inside mass A d 'j and the 
full' symmetric outside mass A d '^ as 

(27) 
(28) 

In the rest of the thesis, when it is clear in the context, we will use inside mass as 
a shorthand for symmetric inside mass, outside mass for symmetric outside mass, full- 
inside mass for full-symmetric inside mass, and full-outside mass for full-symmetric 
outside mass. 

Thus, from Equation|24]the partition function can be computed from the full-inside 
mass at the top level (d = 1) 

z = 

c 

^■X-.T^X-.T 

ses 1 



= E a 1;t (29) 

ses 1 

With the similar derivation the partition function can also be computed from the full- 
outside mass at the bottom level (d = D) 

Z = A -'"- f°r any iG M (30) 
In fact, we will prove a more general way to compute Z in AppendixlBl 

z = E E E ^-T-OC (3D 

ses d ie[x,t] je[t,T] 

for any t G [1, T] and d G [2, £> — 1]. These relations are summarised in Figure [TOl 
•^ = E seS « A^ s foran yi G [1,T] 

• g = E sg s d EigTM E je[t ,Tl A^A^jg for any t G [1, T] and d G [2, £> - 1] 



Figure 10: Computing the partition function from the full-inside mass and full-outside 
mass. 

Given the fact that Q.'f is separated from the rest of variables by the symmetric 
Markov blanket n^' s , we have Proposition^ 
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Proposition 3. The following relations hold 



Pr(C#|n&) = i*[C£f] (32) 

i:j 

p r(C-;iII^) = ^0 (33) 

Pr(n^) = ±X::iC;x!:: (34) 



The proof of this proposition is given in Appendix lA.2l 
4.1.4 Asymmetric Inside/Outside Masses 

Recall that we have introduced the concept of asymmetric Markov blanket Tf'J(u) 
which separates (f' s (u) and £ d ' s (it). Let us group all the local contextual clique po- 

•J — i:j 

tentials associated with Q.'?(u) and rf'j(u) into a joint potential ^[(f.'j (u)]. Similarly, 
we group all local potentials associated with ( d,s (u) and r^' s (u) into a joint potential 

— ' l 'J ■> 

$[(■ '. (u)]. Note that $[£..'. (u)]) includes the state-persistence potential Ftf.'?. 

Definition 6. Given the asymmetric Markov blanket T^'j (u), the asymmetric inside 
mass af.'j(u) and the asymmetric outside mass \f'j(u) are defined as follows 

4?c«) = ^ $[cjw] (36) 

The relationship between the asymmetric outside mass and asymmetric inside mass 
is analogous to that between the outside and inside masses. However, there is a small 
difference, that is, the asymmetric outside mass 'owns' the segment xf.j = s and the 

associated state-persistence potential R^.j, whilst the outside mass Af.j(s) does not. 

4.2 Computing Inside Masses 

In this subsection we show how to recursively compute the pair: inside mass and asym- 
metric inside mass. The key idea here is to exploit the decomposition within the asym- 
metric Markov blanket. As shown in Figure [TT] an outer asymmetric Markov blanket 
can be decomposed into a sub-asymmetric Markov blanket and a symmetric blanket. 

4.2.1 Computing asymmetric inside mass from inside mass 

Assume that within the asymmetric Markov blanket T^'f (u), the child u starts some- 
where at t <E [i, j] and ends at j, i.e. x^ 1 = u, = and efj^ 13-1 = 1. Let us 
consider two cases: t > i and t = i. 
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level d 
level //+1 



T T T' 

Figure 11: Decomposition with respect to symmetric/asymmetric Markov blankets. 

Case 1. For t > i, denote by v = xf^. We have two smaller blankets within 
Tf'J(u): the symmetric blanket E^.i" 1 '" associated with the child u = xf^ 1 , and the 
asymmetric blanket r^; t s _ 1 (u) associated with the child v ending at t — 1 under the 
parent s. Figure[TT|illustrates the blanket decomposition. The assignment Q.'j(u) can 
be decomposed as 

= (ttti(v)Xt+ 1 ' U ,u = x%\4_ 1:] _ 1 =0,e^ D = 1) (37) 
Thus, the joint potential & [Ci-'j ( u )] can be factorised as follows 

*[(t s (u)] = *[c^i(^)] < f>[c"r' u ]^-i< i ' u ( 38 > 

The transition potential A d , + Jf_ 1 is enabled in the context c = (ef_ 1 = 0, e d ^l = 
l,xf = s,x d ^l = v,x d+1 = u), and the state-persistence potential R t ^ 1,u in the 
context c - {e d +\ = 0, e?+ 1:D = 1, e<f +1:D = l,x d f = u). 

Case 2. For t = i, the asymmetric blanket T d '^_ 1 (v) does not exist since i > 
t—1. We have the following decompositions of assignment (^(u) = {£frt 1,U , e i-i = 
1, ef.j-i = 0). In the context c = (e^Lj = 1), the state-initialisation potential 7r^'| is 
activated. Thus we have 

Substituting Equations [381 and [39l into Equation [35l and together with the fact that 
t can take any value in the interval and v can take any value in S d+1 , we have the 
following relation 

«&*(«) = E E E E *[c-tiW]*[c 4 ir'"]<«!r-i^ 1,u + 

te[i+i,j] ves^ 1 c^t-Av) (f+ 1 ' u 

+ E <M& hu }^ hu 

E\ ^ tZ.S / \ A d-\-l,U *d-\-l.S . A d+l.U, d,3 /AfW 

2^ a ik-A v )Kj A v,u,i-i + *"„;» (40) 

te[i+i j] «es d + 1 
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As we can see, the asymmetric inside mass a plays the role of a forward message 
starting from the starting time i to the ending time j. There is a recursion where the 
asymmetric inside mass ending at time j is computed from all the asymmetric inside 
masses ending at time t — 1, for t G [i + l,j. 

There are special cases for the asymmetric inside mass: (1) when i = j, we only 
have 

a i:i ( u ) = A « Ku,i ( 41 ) 

and (2) when d = D — 1, the sum over the index t as in Equation|40]is not allowed since 
at level D the inside mass only spans a single index. We have the following instead 

D— l.s/ \ £>-l,s/ \XD,u A D.s 

a i:j W = a i:j-i ( V > A r-j A v,u,j~l 

ves d + 1 

= E < r -Y(v)R?fAZj-i (42) 

4.2.2 Computing inside mass from asymmetric inside mass 

Notice the relationship between the asymmetric Markov blanket T^'J (u) and the sym- 
metric blanket H^/?, where d < D. When = 1, i.e. the parent s ends at j, and 
rf'J(u) will become Ilf'j with u = x^ +1 . Then we have decompositions = 

«*/(«).« - and = (Cij(u),^ = l,u = xf 1 ). These lead to the 

factorisation 

^■;\ = ^;iu)\E^ (43) 

where the state-ending potential E u 'j is activated in the context c = (e^ = 1). Thus, 
the inside mass in Equation[25]can be rewritten as 

^ - EE *[c-/(")]<- 
= E E 

= E ( 44 > 

This equation holds for d < D. When d = D, we set A^' s = 1 for all s 6 S* 13 and 
i e [1, T], and when = 1, we must ensure that i = 1 and j = T. 

Remark: Equations |40l |4T1 l42l and l44l specify a left-right and bottom-up algorithm 
to compute both the inside and asymmetric inside masses. Initially, at the bottom level 
^i-f ~ 1 f° r * ^ [I'^l anc l s S S^. A pseudo-code of the dynamic programming 
algorithm to compute all the inside and asymmetric inside masses and the partition 
function is given in Figure [T2] 
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Input: D, T, all the potential function values. 
Output: partition function Z; 

Aj:^,fors G S 1 ; 

A d jJ, for d G [2, D - 1], s G S d and 1 < i < j < T; 

a£' s fors G S D andi G [1,T]; 

a^Qu) for (jg[l,g-l],ug S d+1 and 1 < j < j < T 

/* Initialisation */ 

A S S = 1 for a11 i e [1,T] and s G 5 D 
/*Atthe level d=D-l */ 
Fori = 1,2, ...,T 

Forj = i,i + l,...,T 

Compute ' s (u) using Equation l42l 
Compute Aj.j" s using Equation l44l 
EndFor 
EndFor 

/* The main recursion loops: bottom-up and forward */ 
For d = D — 2,D — 3, 1 
Fori = 1,2, ...,T 

For j =i,i + l, ...,T 

Compute a d .'i(u) using EquationHTIlf j = i 
Compute a d ?Uu) using Equationl40llf i > i 
Compute A i .'? using Equationl44llf d > 1 
EndFor 
EndFor 
EndFor 

Compute Z using Equation[29l 



Figure 12: Computing the set of inside/asymmetric inside masses and the partition 
function. 

4.3 Computing Outside Masses 

In this subsection we show how to recursively compute the symmetric outside mass and 
the asymmetric outside mass. We use the same blanket decomposition as in Section l4!2l 
However, this time the view is reversed as we are interested in quantities outside the 
blankets. For example, outside the inner symmetric Markov blanket in Figure [TT] there 
exists an outer asymmetric blanket and another sub-asymmetric blanket on the left. 

4.3.1 Computing asymmetric outside mass from outside mass 

Let us examine the variables ( d ' s (u) associated with the asymmetric Markov blanket 
T d :-(u), for d G [1,D - 1] and 1 < i < j < T (see Definition 0). For j < T, assume 
that there exists an outer asymmetric Markov blanket r d '^(v) for some v G S d+1 and 
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t £ [j + 1,T], and a symmetric Markov blanket 11.^.'" right next to rf.'j(u). Given 

these blankets we have the decomposition £ (u) = t (ujjCj+i-t ' x j = u )> 
which leads to the following factorisation 



-<2 



^c.;. («)] = $ic i: ; («)]$[c^iVti^^v (45) 

The state transition potential A^'? is enabled in the context c = (e| = 0, e^ +1 = 1), 
and the state persistence potential R^Xi't m tne context c = (e^ +1 = 1, e^J. t _ 1 = 

0, e t d+1 = l). 

In addition, there exists a special case where the state s ends at j. We have the 
decomposition Q . . (u) = ((. . ,u = ) and the following factorisation 

= *[Cj]< S <- (46) 

The ending potential E u 'j appears here because of the context c = (e^ = 1), i.e. s 
ends at j. 

Now we relax the assumption of t, v and allow them to receive all possible values, 

1. e. t G [j, T] and v G S d+1 . Thus we can replace Equation[36lbv 

= E E E E *^;;(«)]*[^]^M^ a 

+ E 

-EE 4^)AS<^ s +A^ (47) 
ues^ 1 te[j+i,T] 

for d G [2, £) — 2], and 1 < i < j < T. Thus, the \f'j(u) can be thought as a message 
passed backward from j = T to j = i. Here, the asymmetric outside mass ending at j 
is computed by using all the asymmetric outside masses ending at t for t G [j + 1, T], 
There are two special cases. At the top level, i.e. d = 1, then A^ s (u) is only 
defined at i = 1, and the second term of the RHS of Equation [47] is included only if 
i = l, j = T. At the second lowest level, i.e. d = D — 1, we cannot sum over t as in 
Equationl47lsince ^f^[. t is only defined for t = j + 1. We have the following relation 
instead 



4.3.2 Computing outside mass from asymmetric outside mass 

Given a symmetric Markov blanket H^ l u for rf 6 [1, D — 1], assume that there exists 
an asymmetric Markov blanket r t .'?(u) at the parent level d, where t G [1, t]. Clearly, 
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for t € — there exists some sub-asymmetric Markov blanket T t ^_ 1 (v). See 
Figure QT|for an illustration. 

Let us consider two cases: t < i and t = i. 

Case 1. Fort < i, this enables the decomposition £. = (C . (u),( t .\_ 1 (v),u = 
x^ 1 ), which leads to the following factorisation 

^[Cj 1 '"] = (49) 

The state transition potential A v ' s i _ 1 is activated in the context c = (ef_ 1 = 0, e^i = 
!)■ 

Case 2. For t = i, the decomposition reduces to £, = (C- '■ u — 
which leads to the following factorisation 

*[c :j ] = *[c j: ;. («)]<? (so) 

The state-initialisation potential tt u '* plays the role in the context c = (ef_ 1 = 1) 

However, these decompositions and factorisations only hold given the assumption 
of specific values of s 6 S d , v 6 S d+1 , and t 6 [1, i]. Without further information we 
have to take all possibilities into account. Substituting these relations into Equationl26l 
we have 

E E E E E + 

s£S d ves d + 1 te[i,i-i] c d 'f(«) cf't-iiy) 

+ E E OK* 
E E A "'» E 4ti(«X:ti + E A "iH<?( 51 ) 

s6S d te[M-i] •ues ,d + 1 seS d 

forde [2,D-2]. 

There are three special cases. The first is the base case where d = and A-^y = 1 
for all s 6 S 1 . In the second case, for d = 1, we must fix the index t = 1 since the 
asymmetric inside mass a t ! i _ 1 is only defined at t = 1. Also the second term in the 
RHS is included only if i = 1 for the asymmetric outside mass Xf'j (u) to make sense. 
In the second case, for d + 1 = D, we only have i = j. 

Remark: Equations l47l |481 and BTI show a recursive top-down and outside-in ap- 
proach to compute the symmetric/asymmetric outside masses. We start from the top 
with d = 1 and A-^ = 1 for all s £ S 1 and proceed downward until d = D. The 
pseudo-code is given in Figure [13] Figure [l4l summarises the quantities computed in 
Sectionl4T2landl431 

Figure[T5lsumm arises the AIO algorithm for computing all building blocks and the 
partition function. 



A 



d+l,u 
i:j 
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Input: D, T, all the potential function values, all inside/asymmetric inside masses. 
Output: all outside/asymmetric outside masses 
Initialise: Aj\^ = 1, 

\\&(u) = E^ T for s G S\u G S 2 
/* the main recursive loops: top-down and inside-out */ 
Ford = 1,2,..., D- 1 
Fori = 1,2,...,T 

Forj = T,T-l,...,i 

Compute the asymmetric outside mass X^iu) using Equations 147148 1 
Compute the outside mass A^'J using EquationlBTl 
EndFor 
EndFor 
EndFor 



Figure 13: Computing the set of outside/asymmetric outside masses. 

• Aj;y, A^y for s i S 1 

• A& a , a£" for d G [2, £> - 1], s G S d , 1 < i < j < T 
.A°f,A% s fori£[l,T],seS D 

• affj(u), Xf.^u) for d = 1, s G S\ u G S* 2 , j G [1, T] 

. cffiM, A?;?(u) for d G [2, D - 1], g G u G 1 < i < j < T 

Figure 14: Summary of basic building blocks computed in Section l4~2"l and l4~3l 



5 The Generalised Viterbi Algorithm 

By definition the MAP assignment is the maximiser of the conditional distribution 
given an observation sequence z 

( MAP = arg maxPr(C|z) 

= arg max $ [(, z] (52) 

For clarity, let us drop the notation z and assume that it is implicitly there. 

The process of computing the MAP assignment is very similar to that of computing 



Input: D, T, all the potential function values 
Output: all building blocks and partition function 

Compute all inside/asymmetric inside masses using the algorithm in Figure 
Compute all outside/asymmetric outside masses using the algorithm in Figure fOI 



Figure 15: The AIO algorithm. 
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the partition function. This similarity comes from the relation between th e sum-produc t 



and max-product algorithm (a generalisation of the Viterbi algorithm) of Pearl ( 1988), 
and from the fact that inside/asymmetric inside procedures described in Section 14.21 
are essentially a sum-product version. What we need to do is to just convert all the 
summations into corresponding maximisations. The algorithm is a two-step procedure: 

• In the first step the maximum joint potential is computed and local maximum 
states and ending indicators are saved along the way. These states and ending 
indicators are maintained in a bookkeeper. 

• In the second step we decode the best assignment by backtracking through saved 
local maximum states. 

We make use of the contextual decompositions and factorisations from Section l4~2l 



Notations 

This section, with some abuse, uses some slight modifications to the notations used in 
the rest of the paper. See Table|2]for reference. 
We now describe the first step. 



5.1 Computing the Maximum Joint Potential, Maximal States and 
Time Indices 

As $[C] = $[tla>]Rl'£> for s e S 1 we have 



max$[Cl = maxi?!'^ max^fCi 1 .'^! (53) 

Now, for a sub-assignment Q.'f for 1 € [1, D — 1], Equation|43lleads to 

max$[& s ] = max E*'* max (u)] (54) 

With some slight abuse of notation we introduce /\™^ d ' s as me optimal poten- 
tial function of the subset of variables £^ s , and a™ ax ' d ' s («) as the optimal potential 
function of the subset of variables (f.f (u) . 

Definition 7. We define /\™ ax ' d ' s an( { a™j X ' d ' s (u) as follows 

A max, d.s ^r>d,si /cc\ 

A Uj = inax<l>;C,. ( (55) 



a max, d.s \ max, d.s j^d.s 

= R i:j (56) 

max,d,s / \ ^ r ><2,s 



<h,i '(«) = ?ax (57) 

C !:3 w 
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The Equations[53]and[54]can be rewritten more compactly as 

$[C MAP ] = max AS?* 1 '* (58) 

sGS 1 

A ,y = max, ' ( w ) (59) 



for d G [1,13 - 1]. When d = D, we simply set A'" aX!jD ' s = 1 for all s G 5 D and 
* G [1,T]. 

From the factorisation in Equation [38] and [39] we have 

max *[#/(«)] - maxifmax max ^f'M^l max S^^^x 



^ \ %d-\-l,ui \ I rtd+l.u d.s ^ r>d+l,ui 



60) 



and 



max,d,s/ \ I / max.d.s/ \ a max,d+l,K aC0,s 



ueS^ 1 te[i+i,j] 



A :j ir Uil W6l) 



for d £ [1, D — 2] and i < j. For d = D — 1, we cannot scan the index t in the interval 
[i + because the maximum inside A" laX: : " is only defined at t = j. We have the 
following instead 



max.D-l.s/ \ max.D-l,s/ \ a max,D,n aD.s 

«-••-■ H = maxa j:j _' 1 (u A . . A ; ( 62 ) 



There is a base case for i = j, where the context c = (ef_ 1 = 1) is active, then 

"fc< («) = A « ^< (63) 

Of course, what we are really interested in is not the maximum joint potentials but 
the optimal states and time indices (or ending indicators). We need some bookkeepers 
to hold these quantities along the way. With some abuse of notation let us introduce 
the symmetric inside bookkeeper A" s,d ' s associated with Equation[59] and the asym- 
metric inside bookkeeper a^' ' s (u) associated with Equations |6T1 162] and |(53l 

Definition 8. We define the symmetric inside bookkeeper A" s ' d ' s as follows 

A are. d,s * r^d.s max, d.s/ \ ttiA\ 

uj = u =argmax tl£Sd+1 B tt ; i a i;j (u) (64) 

Similarly, we define the asymmetric inside bookkeeper a^ r ^ d ' s (u) associated with 
Eauation \61\f or d G [1, D — 2] as 

a v f' ■ (u) = (y,t) =axgraax te[i+ltj]tVeSd +ia i:t _' 1 ' {v)A t:j K,u,t-i ( 65 > 
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i/max„ £S d +I , te[j+lii] a f:t _i («)A t:j A v , u ,t-i > K.j ' l andi < 

j; and 

a^' d ' s (u) — undefined (66) 
otherwise. For d = D — 1, the a^ :d ' s (it) is associated with Equation \62\ 

«<:/ W =argmax, ;eSD a i:j _i (u)A i:j A^j-i (67) 

The Equations l58l59l61l62l andl63lprovide a recursive procedure to compute maxi- 
mumjoint potential in a bottom-up and left-right manner. Initially we just set /\™ ax ' £l ' s 
1 for all s £ S D and i G [1, T], The procedure is summarised in Figure [T6l 

5.2 Decoding the MAP Assignment 

The proceeding of the backtracking process is opposite to that of the max-product. 
Specifically, we start from the root and proceed in a top-down and right-left manner. 
The goal is to identify the right-most segment at each level. Formally, a segment is a 
triple (s, i, j) where s is the segment label, and i and j are start and end time indices, 
respectively. From the maximum inside A™ ax ' s at level d, we identify the best child 
u and its ending time j from Equation|59] This gives rise to the maximum asymmetric 
inside a™ ax ' rf s (u). Then we seek for the best child v that transits to u under the 
same parent s using Equation [61] Since the starting time t for u has been identified 
the ending time for v is t — 1. We now have a right-most segment (u,t,j) at level 
d+ 1. The procedure is repeated until we reach the starting time i of the parent s. The 
backtracking algorithm is summarised in Figure [T71 

Finally, the generalised Viterbi algorithm is given in Figure[T8l 

Working in log-space to avoid numerical overflow 

With long sequence and complex topology we may run into the problem of numerical 
overflow, i.e. when the numerical value of the maximum joint potential is beyond the 
number representation of the machine. To avoid this, we can work in the log-space 
instead, using the monotonic property of the log function. The equations in the log- 
space are summarised in Table [3] 

6 Parameter Estimation 

In this section, we tackle the problem of parameter estimation by maximising the (con- 
ditional) data likelihood. Typically we need some parametric form to be defined for a 
particular problem and we need some numerical method to do the optimisation task. 

Here we employ the log-linear parameterisation, which is commonly used in the 
CRF setting. Recall from Section |2~3l that estimating parameters of the log-linear mod- 
els using gradient-based methods requires the computation of feature expectation, or 
expected sufficient statistics (ESS). For our HSCRFs we need to compute four types 
of ESS corresponding to the state-persistence, state-transition, state-initialisation and 
state-ending. 
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6.1 Log-Linear Parameterisation 

In our HSCRF setting there is a feature vector (er, z) associated with each type of 
contextual clique a, in that <j){<r d , z) = exp(wj d f^(er, z)). Thus, the features are active 
only in the context in which the corresponding contextual cliques appear. 

For the state-persistence contextual clique, the features incorporate state-duration, 
start time i and end time j of the state. Other feature types incorporate the time index 
in which the features are triggered. Specifically, 

6 (i,i,*0) (68) 
( 69 ) 

z) (70) 
z) (71) 

Denote by F^(C> z ) tne global feature, which is the sum of all active features f£(z) 
at level d in the duration [1, T] for a given assignment of ( and a clique type a. Recall 
that r d = {ifcj^j is the set of ending time indices (i.e. ef = 1). The four feature 
types are given in Equations l72ll75l 

F£U. 4rt (C,aO = f*' m4rt (l,ii,«)+ E ft S er,Uik + hik+i,zp2) 

i k £T d ,k>l 

F% n ,« !UiV (t,z) = t^su !U , v (ik,z) (73) 

ik<tT d - 1 ,i k £T d 

F% t JC,z) = f% ttUtV (l,z)+ + l,z) (74) 

■i k er d 

K^jcz) = E f ^,;^) (75) 

Substituting the global features into potentials in Equation. [8] and [9] we obtain the 
following log-linear model: 

Pr(Ck) = exp(^ wJ c F CT . (C, z)) (76) 

where C = {persist, transit, init, exit}. 

Again, for clarity of presentation we will drop the notion of z but implicitly assume 
that it is still in the each quantity. 

6.2 ESS for State-Persistence Features 

Recall from Section lSTTI that the feature function for the state-persistence f d ' p s crsiBt (i,j) 
is active only in the context where U^'J £ C- Thus, Equation [72] can be rewritten as 

F d j^uo= E E & s ^Ui,mK s £$ ( 77 > 

i£[l,T]je[i,T] 
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The indicator function in the RHS ensures that the feature f p er3ist is only active 

d s 

if there exists a symmetric Markov blanket IL^ in the assignment of £. Consider the 
following expectation 

(i, j)5\n% G c]] = £ Pr(C)f^- (*', ./"' 'it e C] (78) 

c 

= ^E $ ^t~(i,i)5[n^eC] (79) 

c 

Using the factorisation in Equation[24]we can rewrite 

E[f d £ rsiBt (i,j)6[Ili:j G C]] = lE*^/]*©^/^-^^^^/ e C] (80) 

c 

Note that the elements inside the sum of the RHS are only non-zeros for those assign- 
ment of £ that respect the persistent state s d .- and the factorisation in Equation[24l i.e. 

C = id-'j) ^i'j> i )• Thus, the equation can be simplified to 

E[#.„uuwn& e c]] = |EE*^/] $ ^>t^-(^j)( 81 ) 

= iA^A^i^f^„ jst (i,i) (82) 
Using Equation[77]we obtain the ESS for the state-persistence features 

E[^" 5 (o] = E E E fc-*-«( i >i)*Pftj ?e ci] 

ie[i,r] je[i,T] 

= | E E A^A^^f^^(i,i) (83) 
i£[i,T] je[i,T] 

There are two special cases: (1) when <i = 1, we do not sum over i,j but fix 
i = l,j = T, and (2) when d = D then we keep j = i. 

6.3 ESS for Transition Features 

Recall that in Section loTI we define f I* , if (i) as a function that is active in the 

I 1 ^transit y \ / 

context c } ranslt = (ef^ 1 = 0, ef — 1), in which the child state u d finishes its job at 
time t and transits to the child state v d under the same parent s d_1 (that is s d_1 is still 
running). Thus Equation|73]can be rewritten as 

f£™„ W> ,(C) = E ^ it ^ v {tW ransit e C] (84) 

te[i,T-i] 
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We now consider the following expectation 

E[f^», u , v (tW ranait G C]] = ^Pr(C)fX-,„,,(*)^[^ ansit G C] (85) 

C 

c 

Assume that the parent s starts at j. Since ef = 1, the child i> must starts at < + 1 
and ends some time later at j > i + 1. We have the following decomposition of the 
configuration £ that respects this assumption 

t = (^ 1 '\v),(^ s (uU% hj ) (87) 
and the following factorisation of the joint potential 

$ [ci = *i^; 1, ^«)]*[^ 1 '^«)]*te iJ ^ , i !i <* 0l * (88) 

The state persistent potential R t +i-j is enabled in the context c = (ef = 1, ef+i-j-i = 
0, = 1) and the state transition potential A^ s v t in the context c translt _ 
Substituting this factorisation into the RHS of Equation[86]gives us 

\ E E E EE ^'^pic-^^p^ 

»e[i,t] je[t+i,r] c.lt 1,s («) Cfj 1 '^") C t +"i: 3 - 
which can be simplified to 

i6[l,i]iG[t+l,T] 

Using Equations[84land[89lwe obtain the ESS for the state-transition features 

te[i,T-i] 

= | E <;*fX- ,«,«(*) E E ^ 1,s (^)a& m («)a^p 

te[l.T-l] ie[l,t] jG[t+l,T] 

When d = 2 we must fix i = 1 since a\^{u) and \!^{v) are only defined at i = 1. 

6.4 ESS for Initialisation Features 

Recall that in Section l6Tl we define i d f nit (£) as a function at level d that is triggered 
at time i when a parent s at level d initialises a child u at level d + 1. In this event, 
the context c imt = (ef —1 = 1) must be activated for i > 1. Thus, Equation [74] can be 
rewritten as 

F^ V (C) = E ^.uW*^"" 6 C] (9D 

ie[l,T] 
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Now we consider the following feature expectation 

c 

= ^E $ ^ f ^«««5[c in "eC] (92) 
c 

For each assignment of £ that enables f^* if we have the following decomposition 

C = (c">Ut +1 '") (93) 

where the context c mi * activates the emission from s to w and the feature function 
f^nit u (i). Thus the joint potential can be factorised as 

m = H£.-(u)mcU 1,u } R if' u <t ( 94 > 

Using this factorisation and noting that the elements within the summation in the RHS 
of Equation[92]are only non-zeros with such assignments, we can simplify the RHS of 
Equation|92lto 

\ E E E ti&w]*^}^ 1 '^^® 

= \ E A^HA^^^f^ Jz) (95) 
je[i,r] 

The summation over j € [i, T] is due to the fact that we do not know this index. 
Using Equation[9T]and[95]we obtain the ESS for the initialisation features 

E[F% u JO] = E E [ f ^»«,«( i ) 5 [ ci ™ t e C]] 
<e[i,T] 

= | E E >h; s mM?' u (96) 

iG[l,T] iG[i,T] 

There are two special cases: (1) when d = 1, there must be no scanning of i but fix 
i = 1 since there is only a single initialisation at the beginning of sequence, (2) when 
d = D — 1, we fix j = i for A^j" is only defined at i = j. 

6.5 ESS for Ending Features 

Recall that in Section loTTI we define f 'f nd (j) as a function that is activated when a 
child u at level d+ 1 returns the control to its parent s at level d and time j. This event 
also enables the context c end = (e^ = 1). Thus Equation [75] can be rewritten as 

F**, jU (C)= E i%,u(jW end e C] (97) 
je[i,T] 
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Now we consider the following feature expectation 

E[i d a ^Jj)S[c end e C]] = £Pr(0#^OW^eC] 

c 

= ^E $ raf CT d ;L >u (i)5[c e " d GC] (98) 
c 

Assume that the state s starts at i and ends at j. For each assignment of £ that enables 
f end u (j) and respects this assumption, we have the following decomposition 

c =(£;,<&(«)) (99) 

This assignment has the context c end that activates the ending of u. Thus the joint 
potential can be factorised as 

m = ^n^ s {n)\Ri;E d 4 aoo) 

Substituting this factorisation into the summation of the RHS of Equation|98lvields 

E E E *[£]*[c^w]^<;^, u (i)= E ^«t s («X;C-, u o')( 101 ) 

Using Equations 1971 and Tl 1 1 we obtain the ESS for the exiting features 

E[F d J nd JC)} = E n&LJMeU e C]] 
3'e[i,T] 

= | E *4#£-,,C?) E ^/^'/(u) (102) 
je[i,T] *e[ij] 

There is a special case: when d — 1 there must be no scanning of i,j but fix 
< I-./ I. 

7 Partially Observed Data in Learning and Inference 

So far we have assumed that training data is fully labeled, and that testing data does 
not have any labels. In this section we extend the AIO to handle the cases in which 
these assumptions do not hold. Specifically, it may happen that the training data is not 
completely labeled, possibly due to lack of labeling resources. In this case, the learning 
algorithm should be robust enough to handle missing labels. On the other hand, during 
inference, we may partially obtain high quality labels from external sources. This 
requires the inference algorithm to be responsive to that data. 
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7.1 The Constrained AIO algorithm 



In this section we consider the general case when ( = h), where d is the visible set 
labels, and h the hidden set. Since our HSCRF is also an exponential model it shares the 
same computation required for general CRFs (Equations[6]and[7]l. We have to compute 
four quantities: the partial log-partition function Z($, z), the partition function Z(z), 
the 'constrained' ESS E h ^ tZ [F(d, h, z)], and the 'free' ESS E C | Z [F(C, z)]. The parti- 
tion function and the 'free' ESS has been computed in Sections [4] and [6] respectively. 
This section describes the other two quantities. 

Let the set of visible labels bed = (x, e) where x is the visible set of state variables 
and e is the visible set of ending indicators. The basic idea is that we have to modify 
procedures for computing the building blocks such as Af'J and a^{u), to address 

constraints imposed by the labels. For example, A^ s implies that the state s at level 
d starts at i and persists till terminating at j. Then, if any labels (e.g. there is an 
s for k G are seen, causing this assumption to be inconsistent, A^'J will 

be zero. Therefore, in general, the computation of each building block is multiplied by 
an identity function that enforces the consistency between these labels and the required 
constraints for computation of that block. As an example, we consider the computation 
of A^ s and af;?(u). 

The symmetric inside mass Af'J is consistent only if all of the following conditions 
are satisfied: 

1. If there are state labels xf at level d within the interval [i, j], then xf = s, 

2. If there is any label of ending indicator ef_ lt then ef_ 1 = 1, 

3. If there is any label of ending indicator for some k e [i,j — 1], then = 0, 
and 

4. If any ending indicator e| is labeled, then ej = 1. 

These conditions are captured by using the following identity function: 

I[A&] = S[x d ke[itj] = s]6[et 1 = l]<^ e[i:i -i] = 0]5$ = 1] (103) 
When labels are observed, Equation 04] is thus replaced by 

Note that we do not need to explicitly enforce the state consistency in the summation 
over u since in the bottom-up and left-right computation, a^.'j (it) is already computed 
and contributes to the sum only if it is consistent. 

Analogously, the asymmetric inside mass af.'J (u) is consistent if all of the follow- 
ing conditions are satisfied: 

1 . The first three conditions for the symmetric inside mass A^'f hold, 

2. If the state at level d at time j is labeled, it must be u, and 
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3. If any ending indicator e^ +1 is labeled, then e^ +1 = 1. 
These conditions are captured by the identity function 

= 6[x d ke[itj] = swu = imte^-i] = = = 

Thus Equation[40lbecomes 

= ![«%{")]( E E ^iWAS^Vx + Ag^^Aioo 

\ k=i+i „es d + 1 / 

Note that we do not need to explicitly enforce the state consistency in the summation 
over v and time consistency in the summation over k since in bottom-up computa- 
tion, af'J(u) and A^ 1 '" are already computed and contribute to the sum only if they 
are consistent. Finally, the constrained partition function Z{&, z) is computed using 
Equation|29]given that the inside mass is consistent with the observations. 

Other building blocks, such as the symmetric outside mass A^ s and the asymmetric 

outside mass Xf'j(u), are computed in an analogous way. Since A^ s and A^ s are 
complementary and they share (d, s, the same indicator function I[A^ S ] can be 
applied. Similarly, the pair asymmetric inside mass a^(u) and asymmetric outside 
mass \f'j(u) are complementary and they share d,s,i,j,u, thus the same indicator 
function I[a<^'?(ii)] can be applied. 

Once all constrained building blocks have been computed they can be used to cal- 
culate constrained ESS as in Section [6] without any further modifications. The only 
difference is that we need to replace the partition function Z(z) by the constrained 
version Z(p, z). 



7.2 The Constrained Viterbi Algorithm 

Recall that in the Generalised Viterbi Algorithm described in Section[5]we want to find 
the most probable configuration ^ MAP = argmax^ Pr(£|z). When some variables d 
of £ are labeled, it is not necessary to estimate them. The task is now to estimate the 
most probable configuration of the hidden variables h given the labels: 

h MAP = argmaxPr(/i|i?, z) 

h 

= argmaxPr(/i, $|z) 

h 

= argmax$[/i,i?,z] (107) 

h 

It turns out that the constrained MAP estimation is identical to the standard MAP except 
that we have to respect the labeled variables #. 

Since the Viterbi algorithm is just the max-product version of the AIO, the con- 
strained Viterbi can be modified in the same manner as in the constrained AIO (Sec- 
tion l7.U . Specifically, for each auxiliary quantities such as A™"' 5 an( j a™ aa: ' s (u), we 
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need to maintain a set of indicator functions that ensures the consistency with labels. 
Equations 11031 and ll04l become 

I[A,r J, 1 = S &L[i,i] = s Wti = m^eli-.j-i] = 0]<5[e? = 1] 

Amax.d.s irr a max, d, si I max,d,s / \ Tr\d,s \ / 1 r»o\ 

Likewise, we have the modifications to Equation 1 105 1 and Equation ll06l respectively. 

max,d,s / \ -rrr max,d.s / \i max,d,s / \ \ max,d-\-l,u *d,s 

a n - W = H a i-i ' i u ) max< max max a (v)A k - A , _ 1 ; 

i:j n u,i } ( 109 ) 

Other tasks in the Viterbi algorithm including bookkeeping and backtracking are 
identical to those described in Section[5] 

7.3 Complexity Analysis 

The complexity of the constrained AIO and constrained Viterbi has an upper bound of 
0(T 3 ), when no labels are given. It also has a lower bound of 0(T) when all ending 
indicators are known and the model reduces to the standard tree-structured graphical 
model. In general, the complexity decreases as more labels are available, and we can 
expect a sub-cubic time behaviour. 



8 Numerical Scaling 



In previous sections, we have derived AlO-based inference and learning algorithms for 
both unconstrained and constrained models. The quantities computed by these algo- 
rithms like the inside/outside masses often involve summation over exponentially many 
positive potentials. The potentials, when estimated from data, are often not upper- 
bound, leading to the fact that the magnitude of the masses increases exponentially fast 
in the sequence length T, thus goes beyond the numerical capacity of most machines 
for moderate T. 

In this section we present a scaling method to reduce this numerical overflow prob- 
lem. The idea can be tra ced back to the Pearl's message-passing procedure ( IPearl , 
1988c lYedidia et al.l 120051) . Our AIO algorithms can be considered as generalisation 



of the message-passing, in which the inside masses play the role of the inside-out mes- 
sages. In Pearl's method, we reduce the messages' magnitude by normalising them at 
each step. In the context of HHMMs with which the numerical u nderflow problem is 



associated, the similar idea has been proposed in dBui et al. 
our overflow problem. 



2004), which we adapt to 
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8.1 Scaling the Symmetric/Asymmetric Inside Masses 

Before proceeding to algorithmic details let us revisit Equation [44] If we scale down 
the asymmetric inside mass a^ s (w) by a factor Kj > 1, i.e. 

d,s / \ 

'd,Sf s , a i-j W i f\\ 

a i:j ( U ) < " 

Kj 

then the symmetric inside mass A^ s is also scaled down by the same factor. Similarly, 
as we can see from Equationl40lthat 

3 

d,s f \ \ ^ \ ^ d,s t \ \ d+l,u Ad,s . \ d,s 

a i-.j\ U )= 2^ a i-t-l( V ) A t:j A v,u,t-l + \:j ' K,i 

i=i+l ■ueS d + 1 

where A^ 1 ' 11 = A t ^ 1 ' u R t ^ 1 ' u , if Aj.i" 1 '" for t £ is reduced by Kj, then o^ s 
is also reduced by the same factor. In addition, using the set of recursive relations in 
Equations [40] and [44] any reduction at the bottom level of A-f.f will result in the re- 
duction of the symmetric inside mass A^ s and of the asymmetric inside mass a^{u), 
for d < D, by the same factor. 

Suppose A^' s for all i <S [1, j] is reduced by a factor of Ki > 1, the quantities A^.'* 



and af;'j (u) will be reduced by a factor of Yll=i K i- That is 



i d,s 

A# <- (in) 



«'#(«) <- (112) 



n 



i=l 

It follows immediately from Equation [29] that the partition function is scaled down by 
a factor of YiJ=i K « 

S£S 1 11j=1 k 3 

~ 'l 'is Is 

where A x .y = A^B^. Clearly, we should deal with the log of this quantity to 
avoid numerical overflow. Thus, the log-partition function can be computed as 

T 

log(Z) = log A 1:T + E l0 § K J (1 U) 

ses 1 3=1 

where Aj 1 ^ has been scaled appropriately. 

One question is how to choose the set of meaningful scaling factors {Kj}J. The 
simplest way is to choose a relatively large number for all scaling factors but making 
the right choice is not straightforward. Here we describe a more natural way to do 
so. Assume that we have chosen all the scaling factors Using the original 
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Equations[40]|4T] and|42] where all the sub-components have been scaled appropriately, 
we compute the partially-scaled inside mass Aj.^' s for d G [2,D] and asymmetric 

inside mass a^^lu), for d G [1, D — 1] and i G Then the scaling factor at time 

j is computed as 

s,u 

The next step is to rescale all the partially-scaled variables: 

"d,s/ \ 

a,^ s {u) 4r- — ^ — — — for s G S d , d E [1,D — 1] (116) 

Kj 

A" d,s 

\ d f <- -522- for a G S d , d G [2, D - 1] (117) 

A "D,s 

A,'}" <- ^tor^ (118) 

where i G [1, j]. 



8.2 Scaling the Symmetric/Asymmetric Outside Masses 

In a similar fashion we can work out the set of factors from the derivation of symmet- 
ric/asymmetric outside masses since these masses solely depend on the inside masses 
as building blocks. In other words, after we finish scaling the inside masses we can 
compute the scaled outside masses directly, using the same set of equations described 
in Section 03] 

The algorithm is summarised in Figure [19] Note that the order of performing the 
loops in this case is different from that in Figure [T2l 



9 Applications 

9.1 Recognising Indoor Activities 

In this experiment, we evaluate the HSCRFs with a relatively small dataset from the 
domain of indoor video surveillance. The task is to recognise indoor trajectories and 
activities of a pe rson from his noisy p ositions extracted from video. The data, which 



was captured in (|Nguven et all 120051) . and subsequently used to evaluate DCRFs in 
(ITruyen et aZl l2006h . las 90 sequences, each of which corresponds to one of 3 the 



persistent activities: (1) preparing short-meal, (2) having snack and (3) preparing 
normal-meal. The persistent activities share the some of 12 sub-trajectories. Each 
sub-trajectory is a sub-sequence of discrete positions. Thus naturally, the data has a 
state hierarchy of depth 3: the dummy root for each position sequence, the persistent 
activities, and the sub-trajectories. The input observations to the model are simply 
sequences of discrete positions. 
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We split the data into two sets of equal size for training and testing, respectively. 
For learning, labels for each sequence are provided fully for the case of fully observed 
state data, and partially for the case of missing state data. For testing, no labels are 
given to the decoder, and decoded labels obtained from the max-product algorithm are 
compared against the ground-truth. 

In designing features, we assume that state features (i.e. between nodes) such as 
initialisation, transition and exiting are indicator functions. For the data-associations 
(i.e. embedd ed in state-persistenc e potentials) at the bottom level, we use the same 
features as in dTruven et al. 120061) . At the second level, we use average velocities and 



a vector of positions visited in the state duration. To encode the duration into the state- 
persistence potentials, we employ the sufficient statistics of the gamma distribution as 
features f k (s, At) = 1(a) log(At) and f k+1 (s, At) = I(s)(Ai). 

At each level d and time t we count an error if the predicted state is not the same 
as the ground-truth. Firstly, we examine the fully observed case where the HSCRF 
is compared against the DCRF at both data levels, and against the flat-CRF at bottom 
level. Table [4] (the left half) shows that (a) both the multilevel models significantly 
outperform the flat model and (b) the HSCRF outperforms the DCRF. 

We also test the ability of the model to learn the hierarchical topology and state 
transitions. We find the it is very informative to examine parameters which correspond 
to the state transition features. Typically, negative entries in the transition parame- 
ter matrix means that the transition is improbable. This is because state features are 
non-negative, so negative parameters mean the probabilities of these transitions are 
very small (due to the exponential), compared to the positive ones. For the transition 
at the second level (the complex activity level), we obtain all negative entries. This 
clearly match the training data, in which each sequence already belongs to one of three 
complex activities. With this method, we are able to construct the correct hierarchical 
topology as in Figure [20] The state transition model is presented in Figure [21] There 
is only one wrong transition, from state 12 to state 10, which is not presented in the 
training data. The rest is correct. 

Next we consider partially-supervised learning in that about 50% of start/end times 
of a segment and segment labels are observed at the second level. All ending indicators 
are known at the bottom level. The results are reported in Table [4] (the right half). 
As can be seen, although only 50% of the state labels and state start/end times are 
observed, the model learned is still performing well with accuracy of 80.2% and 90.4% 
at levels 2 and 3, respectively. 

We now consider the issue of using partial observed labels during decoding to im- 
prove prediction accuracy of poorly estimated models. We extract the parameters from 
the 10th iteration of the fully observed data case. The labels are provided at random 
time indexes. Figure l22h shows the decoding accuracy as a function of available state 
labels. It is interesting to observe that a moderate amount of observed labels (e.g. 
20 — 40%) causes the accuracy rate to go up considerably. 

9.2 POS Tagging and Noun-Phrase Chunking 

In this experiment we apply the HSCRF t o the task of noun-phrase chunking. The 
data is from the CoNLL-2000 shared task dSang and BuchholzlBoOOl) , in which 8926 
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English sentences from the Wall Street Journal corpus are used for training and 2012 
sentences are for testing. Each word in a pre-processed sentence is labeled by two 
labels: the part-of-speech (POS) and the noun-phrase (NP). There are 48 POS differ- 
ent labels and 3 NP labels (B-NP for beginning of a noun-phrase, I-NP for inside a 
noun-phrase or O for others). Each noun-phrase generally has more than one word. 
To reduce the computational burden, we reduce the POS tag-set to 5 groups: noun, 
verb, adjective, adverb and others. Since in our HSCREs we do not have to explicitly 
indicate which node is at the beginning of a segment, the NP label set can be reduced 
further into NP for noun-phrase, and O for anything else. 

The POS tags are actually the output of the Brill's tagger OBrilll 1 1 995b , while the 
NPs ar e manually labeled. We extract raw features from the text in the way similar to 
that in dSutton et aU 120071) . However, we consider only a limited vocabulary extracted 
from the training data in that we only select words with more than 3 occurrences. This 
reduces the vocabulary and the feature size significantly. We also make use of bi-grams 
with simil ar selection criteria. Furthermore, we use the contextual window of 5 instead 
of 7 as in dSutton et al. l2007h . This setting gives rise to about 32K raw features. The 
model feature is factorised as f(x c , z) = I(x c )g c (z), where I(x c ) is a binary function 
on the assignment of the clique variables x c , and g c {z) are the raw features. 

We build an HSCRF topology of 3 levels where the root is just a dummy node, the 
second level has 2 NP states and the bottom level has 5 POS states. For comparison, 
we impleme nt a DCRF, a simple seque ntial CRF (SCRF), and a semi-Markov CRF 
(SemiCRF) dSarawagi and Coherl 120041) . The DCRF has grid structure of depth 2, one 
for modelling the NP process and another for the POS process. Since the state spaces 
are relatively small, we are able to run exact inference in the DCRF by collapsing both 
the NP and POS state spaces to a combined state space of size 3x5 = 15. The SCRF 
and SemiCRF model only the NP process, taking the POS tags as input. 

The raw feature set used in the DCRF is identical to those in our HSCRF. However, 
the set shared by the SCR F and the SemiCRF is a little more elaborate since it takes 



the POS tags into account dSutton et q/.l 120071) 



Although both the HSCRF and the SemiCRF are capable of modelling arbitrary 
segment durations, we use a simple exponential distribution as it can be processed 
sequentially and thus is very efficient. For learning, we use a simple online stochastic 
gradient ascent method sinc e it has been shown to work relatively well and fast in CRFs 



dVishwanathan et al. ■ 120061) . At test time, as the SCRF and the SemiCRF are able to 



use the Brill's POS tags as input, it is not fair for the DCRF and HSCRF to predict 
those labels during inference. Instead, we also give the POS tags to the DCRF and 
HSCRF and perform constrained inference to predict only the NP labels. This boosts 
the performance of the two multi-level models significantly. 

The performance of these models is depicted in Figure [23] and we are interested in 
only the prediction of the noun-phrases since this data has Brill's POS tags. Without 
Brill's POS tags given at test time, both the HSCRF and the DCRF perform worse than 
the SCRF. This is not surprising because the Brill's POS tags are always given in the 
case of SCRF. However, with POS tags the HSCRF consistently works better than all 
other models. The DCRF does worse t han the SCRF , even with POS tags given. This 
does not share the obser vation made in dSutton et al. , 2007 ). However, we use a much 
smaller POS tag set than ( Sutton et al. , 20071) does. Our explanation is that the SCRF is 
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able to make use of wider context of the given POS tags (here, within the window of 5 
tags) than the DCRF (limited to 1 POS tag per NP chunk). The SemiCRF, although in 
theory it is more expressive than the SCRF, does not show any advantage under current 
setting. Recall that the SemiCRF is a special case of HSCRF in that the POS level is 
not modelled, it is possible to conclude that joint modelling of NP and POS levels is 
important. 



10 Conclusions 

In this paper, we have presented a novel model called Hierarchical Semi-Markov Con- 
ditional Random Field which extends the standard CRFs to incorporate hierarchical 
and multilevel semantics. We have developed a graphical model-like dynamic repre- 
sentation ofJhe_JiSCRF 1 _This appe ars similar to the DBN representation of the HH- 



MMs in (Murphv and Paskin, 2002), and somewhat resembles a dynamic factor graph 
dKschischang et all 1200 lb . However, it is not exactly the standard graphical model 
because the contextual cliques in HSCRFs are not fixed during inference. 

We have derived efficient algorithms for learning and inference, especially the abil- 
ity to learn and inference with partially given labels. We have demonstrated the capac- 
ity of the HSCRFs on home video surveillance data and the shallow parsing of English 
text, in which the hierarchical information inherent in the context helps to increase the 
recognition. 

In future work we plan to attack the computational bottleneck in large-scale set- 
tings. Although the AIO family has cubic time complexity, it is still expensive in 
large-scale application, especially those with long sequences. It is therefore desirable 
to introduce approximation methods that can provide speed/quality trade-offs. 

We also need to make a choice between pre-computing all the potentials prior 
to inference and learning, and computing them on-the-fly. The first choice requires 
0(D\S\ 3 T 2 ) space, which is very significant with typical real-world problems, even 
with today's computing power. The second choice, however, will slow the inference 
and learning very significantly due to repeated computation at every step of the AIO 
algorithm. 

Perhaps one of the most interesting point is that how good the HSCRFs can be an 
approximation to general multilevel processes, which are not necessarily recursive. For 
example, it is interesting to see if any data which is naturally represented as a DCRF 
can be approximately represented by an HSCRF. This is important because HSCRFs 
are tractable while DCRFs are generally not. Some data is intrinsically sequential in 
the sense that there is no really 'exiting' point. The HSCRFs force some transitions at 
the edge of segments to be broken, so the best HSCRFs can do is to model quite long 
segments. 
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A Proofs 

A.l Proof of Propositions ED and |2] 

Before proving PropositionQ]and[2]let us introduce a lemma. 
Lemma 1. Given a distribution of the form 

Pr(x) = -$[x] (119) 

Z 

where x = (x , x s , Xb), if there exists a factorisation 

$[x] = $[x a , x s ]$[x s ]$[x s ,x 6 ] (120) 

then x a and Xb are conditionally independent given x s . 

Proof: We want to prove that 

Pr(x a ,x 6 |x s ) = Pr(x a \x s ) Pr(xb\x s ) (121) 

Since Pr(x a ,x b \x s ) = Pr(x a ,Xb,x s )/J2x a ,x b Pr(x a ,x b ,x s ), the LHS of Equa- 
tion U21l becomes 

$[x a ,x s }<£[x s }$[x s ,Xb} 



Pr(x a ,x b \x s ) = 



Ex a ,x b <^[x a ,x s }^[x s ]^[x s ,x b ] 
$[x a ,x s ] $[x s ,x b ] 



(122) 



where we have used the following fact 

^2 $[x a ,x s ]<i>[x s ]${x s ,Xb] = $[x s ](^2$[xa,x s ]j (^2$[x s ,xb]j (123) 

and canceled out the normalisation factor Z and $[x s ]. 

To prove Pr(x a |x s ) = $[x a , x s ]/ $[x a , x s ], we need only to show Pr(x a \x s ) oc 
<I>[x a ,x s ] since the normalisation over x a is due to J2 X P r (x a \x s ) — !■ Using the 
Bayes rule, we have 

Pr(x a |x s ) oc Pr(x a ,x s ) 

= y^Pr(x a ,x s ,Xi,) 

Xb 

= ^$[x a ,x s ]$[x s ]^$[x s ,Xf,] 

oc $[x a ,x s ] (124) 
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where we have ignored all the factors that do not depend on x a . 

A similar proof gives Pr(x& | x s ) = &[x s , x\\j ~YL Xb ®i x s' x b\- Combining this re- 
sult and Equation 1 1 241 with Equation 1 1 221 gives us Equation 11211 This completes the 
proof ■ 

In fact, x s acts as a separator between x a and Xb- In standard Markov networks 
there are no paths from x a to Xb that do not go through x s . Now we proceed to proving 
PropositionQ] and[2] 

Given the symmetric Markov blanket H,.^?, there are no potentials that are associ- 
ated with variables belonging to both Cf' 3 and C, d ' s . The blanket completely separates 

■J — %:j 

the Q.'? and ( d ' s . Therefore, Lemma Q] ensures the conditional independence between 

Similarly, the asymmetric Markov blanket T^^u) separates d-f( u ) ar, d Cfji u ) 
and thus these two variable sets are conditionally independent due to Lemma[TjB 

A.2 Proof of Proposition [3] 

Here we want to derive Equations [32] [33] and [34] With the same conditions as in 
Lemma[T] in Equation 1 1241 we have shown that Pr(x a \x s ) oc $[x a , x s ]- Similarly, this 
extends to 

Pr(C|/|nf' s ) « 

= <J>[C-; S ] (125) 

which is equivalent to 

Pr(£!;|n&) 



The last equation follows from the definition of the symmetric inside mass in Equa- 
tion!^ Similar procedure will yield Equation[3~3l 

To prove Equation[34l notice the Equationl2T1that says 

Pr(C) = Pr(n^)Pr(C^|n^)Pr(C^|n^) (127) 



(126) 
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or equivalently 



Pr(nf; s ) = Pr(C) 3 3 3 3— (128) 

^Pr(C^|n^)Pr(C^|II^) 

I A d ' s A' ; ' 
= ±$\c] % jl ^_ (129) 

= i$ft d ' s li? d ' s $ft 1 ^ ^- (130) 

= ±/±*:?b*:?a*:? (i3i) 

In the proof proceeding, we have made use of the relation in Equation[24] This com- 
pletes the proof ■ 

B Computing the State Marginals of HSCRF 

We are interested in computing the marginals of state variables Pr(x^). We have 

Pr(xf) = £ Pr (*t,CW) 

C\x* 

= ^Pr(C)^feC) 

c 

= |5>[CW*?eC) (132) 

c 

Let s = and assume that the state s starts at i and end at j, and t E [hj]- 
For each configuration £ that respects this assumption, we have the factorisation of 
Equationl24lthat says 

m=^&m0Ri:° d33) 

Then Equation l 1 32l becomes 

pr(xt = S ) = ±Y,*[ft;]*\g:j]Rt;;8(t€[i,j]) 

c 



7 E E d34) 



z 

ie[i,t]je[t,T] 

The summing over i and j is due to the fact that we do not know these indices. 

There are two special cases, (1) when d = 1 we cannot scan the left and right 
indices, the marginals are simply 

Pr^ 1 = a) = ^Al: & T (135) 
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since Aj'^ = 1 for all s G S 1 ; and (2) when d = D, the start and end times must be 
the same (i = j), thus 

Pr(xf = 8) = \Kf (136) 

since A t D t ' s = 1 for all t G [1, T] and s G S D . 

Since X) s gs d P*^* = s ) — 1> it follows from Equation ll34l that 

z = E E E 

sGS d <e[i,t] j'e[t,T] 

This turns out to be the most general way of computing the partition function. Some 
special cases have been shown earlier. For example, when d = 1, i = 1 and j = T, 
Equation l 1 37| becomes Equation|29lsince Aj;y = 1. Similarly, when d = D, i = j = t, 
Equation l 1 37l recovers Equation[30l since A^' s = 1. 



C Semi-Markov CRFs as Special Case of HSCRFs 

In thi s Appendix we first describe the semi-Markov CRF (SemiCRF) dSarawagi and Cohen , 



2004) in our HSCRF framework and show how to convert a SemiCRF into an HSCRF. 



Then under the light of HSCRF inference we show how to modify the original Semi- 
CRF to handle (a) partial supervision and constrained inference, and (b) numerical 
scaling to avoid overflow. The modifications are of interest in their own right. 



C.l SemiCRF as an HSCRF 

SemiCRF is an interesting fiat segmental undirected model that generalises the chain 
CRF. In the SemiCRF framework the Markov process operates at the segment level, 
where a segment is a non-Markovian chain of nodes. A chain of segments is a Markov 
chain. However, since each segment can potentially have arbitrary length, inference in 
SemiCRFs is more involved than the chain CRFs. 

Represented in our HSCRF framework (Figure l24li. each node x t of the SemiCRF 
is associated with an ending indicator e t , with the following contextual cliques 

• Segmental state, which corresponds to a single segment Si-j and is essentially the 
state persistence contextual clique in the context c = (ej_i : j = (1,0, .., 0, 1)) in 
the HSCRF' s terminology. 

• State transition, which is similar to the state transition contextual clique in the 
HSCRFs, corresponding to the context c = (e t = 1). 

Associated with the segmental state clique is the potential Rf.j, and with the state 
transition is the potential A s i jSt t, where s, s' G S, and S = {1, 2, 

A SemiCRF is a three-level HSCRF, where the root and bottom are dummy states. 
This gives a simplified way to compute the partition function, ESS, and the MAP as- 
signment using the AIO algorithms. Thus, techniques developed in this paper for nu- 
merical scaling and partially observed data can be applied to the SemiCRF. To be more 
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consistent with the literature of flat models such as HMMs and CRFs, we call the asym- 
metric inside/outside masses by the forward/backward, respectively. Since the model 
is flat, we do not need the inside and outside variables. 

Forward 

With some abuse of notation, let . = (x\ : j_i, ei :j -_i, Xj = s,ej = 1). In other 
words, there is a segment of state s ending at j. We write the forward ctt(s) as 

= (138) 
As a result the partition function can be written in term of the forward as 

Z(Z) = ^[C1:T,Z]=J2J2^:T^] 

Cut s Ci 3 :T 

= $> T (s) (139) 

s 

We now derive a recursive relation for the forward. Assume that the segment ending 
at j starts somewhere at i G Then fori > 1, there exists the decomposition ■ = 

(Cf-i-i! x i-.j — s i e i:j-i = 0) f° r some s ', which leads to the following factorisation 

m tJ ,z] = ^-Ms'^-iKj d40) 

The transition potential A a i iS i-i occurs in the context c = (e;_i = 1), and the seg- 
mental potential ij| • in the context c = (a^.j = s, e»_i = 1, ejy_i = 0). 

For i = 1, the factorisation reduces to ^[Ci^-i z ] = -R* j- Since we do not know the 
starting i, we must consider all possible values in the interval [l,j. Thus, Equation ll38l 
can be rewritten as 

= EEE $ i^*-^ +il « (141) 

= ;••>' M-'.-. i/C, • '-I:, (142) 

iG[2,j] s' 

Backward 

The backward is the 'mirrored' version of the forward. In particular, let £ 8 = 

(xj+\ : T, ej : T, Xj = s, ej-\ = 1). and we define the backward /3 t (s) as 

PM = 5>I£-T'*] (143) 

Clearly, the partition function can be written in term of the backward as 

Z(z) = Y,Pi{a) (144) 
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The recursive relation for the backward 

E Y, R i--j A ^'d^+i(s')+Rt.T (145) 

je[i,T-l] s> 

Typically, we want to limit the segment to the maximum length of L € [1 , T] . This 
limitation introduces some special cases when performing recursive computation of the 
the forward and backward. Equ ation 1 1 4 1 1 and [ 1 45 1 are rewritten as follows 

M s ) = E ^"i-iWA'.'.i-iK.j + Kj ( 146 > 

iG[j-i+l,j],i>l «' 

&oo = E Y, R tj A °>°'>^+ i W +R i--T ,147 ' 

ie[i,i+L-i],j<T »' 

Since it is a bit clumsy to represent a SemiCRF as a three-level HSCRF, we can 
extend the HSCRF straightforwardly by allowing the bottom level states to persist. 
With this relaxation we have a nested SemiCRF model in the sense that each segment 
in a Markov chain is also a Markov chain of sub-segments. 

C.2 Partially Supervised Learning and Constrained Inference 

Following the intuition in Section 17.11 we require that all the forward and backward 
quantities and the potentials Rf.j used in Equations 11461 and 11471 must be consistent 
with the labels in the case of partial supervision and constrained inference. 

Specifically, any quantities that are not consistent are set to zero. Let the labels be 
i3 = (x,e). Then the potential R?.j is consistent if it satisfies the following require- 
ments: 

• if there are any labeled states in the interval they must be s, 

• if there is any labeled ending indicator ej_i, then ej_i = 1, 

• if there is any labeled ending indicator for some k € [i, j — 1], then ej. = 0, 
and 

• if any ending indicator ej is labeled, then e.j = 1 . 

These conditions are captured by using the following identity function: 

I[i?y = 5[x ke[id] - s]<f[ei_i - l]<5[e fce[ . i: ,-i] = 0]<5[e, = 1] (148) 

Notice how these conditions and equation resembles those in the Equation ll03l This is 
because a SemiCRF is just a simplified version of an HSCRF where the potential Rf.j 
plays the role of the inside A^ 5 . 

Similarly, the forward <x, (s) is consistent if the following conditions are satisfied: 

• if there is a labeled ending indicator at j, then e} = 1, and 

• if there is a labeled state at j, then Xj = s. 
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The consistency is captured in the following identity function: 

l[ aj (s)} = 5\ej = l]6[xj = s] (149) 
Furthermore, the backward (3i(s) is consistent where: 

• if there is a labeled ending indicator at i — 1, then e,_i = 1, and 

• if there is a labeled state at i then Xi = s. 
And again, we have the following identity function 

I\Pi(8)]=6fe- 1 = l]8$i = 8] (150) 

By installing the consistency identity functions in Equations l 1481 [T49l and |150l into 
Equations 1 1 461 and 1 1 471 we now arrive at 

aj(s) = I[ay(a)] ]T Y. «i-i(« / )^',.,i-il[^fcj]^ + Wu^Ri.j (151) 

\te[3-i+i,j],»>i s> J 



p^s) = mis)) Yl E Wi^i^^i+iW) + miM-T (152) 

\j£[i,i+L-l],j<T s' J 

C.3 Numerical Scaling 

We have already shown that a SemiCRF is indeed a 3-level HSCRF where the top and 
the bottom levels are dummy states, that is, the state size is one and all the potentials 
associated with them have a value of one. To apply the scaling method described in 
Section[8] we notice that 

• a t (s) plays the role of the asymmetric inside mass aj.'j (s) 

• Pt{ s ) plays the role of the asymmetric outside mass A-^ (s) 

What we do not have here is the explicit notion of inside mass A,-'?, but it can be 
considered as having a value of one. So to apply the scaling algorithm in Figure [19] 
we may scale the state-persistence potential Rf.j instead. The simplified version of 
Figure [l9]is given in Figure [25] 

Of course, the partial scaling step can be the source of numerical overflow with 
llfc=i K k- The trick here is to realise that bj J\ k cik = exp(log b — ^ fc logafc) so that 
we never compute b/ Y\ k directly but the equivalence cxp(log b — J2k 1°S a *:)- 
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Notation Description 



^d-.d' 


Subset of state variables from level d down to level d' 




and starting from time i and ending at time j, inclusive. 


e d:d' 
i:j 


Subset of ending indicators from level d down to level d! 




and starting from time i and ending at time j, inclusive. 


/-d,s 


Set of state variables and ending indicators of a 




sub model rooted at s d , level d, spanning a sub-string [i, j] 


a 


Contextual clique 


i,j,t 


Time indices 




Set of all ending time indices, e.g. if i € r d then ef = 1 


r, s, u, v, w 


State 


H i:j 


State-persistence potential of state s, level d, spanning 


d.s 


Initialisation potential of state s at level d, time i initialising sub-state u 




Transition at level d, time i from state uto v under the same parent s 




Ending potential of state z at level d and time i, and receiving 




the return control from the child u 




The global potential of a particular configuration £ 


s d 


The number of state symbols at level d 




The symmetric inside mass for a state s at level d, 




spanning a substring 




The full symmetric inside mass for a state s at level d, 




spanning a substring 


The symmetric outside mass for a state s at level d, 




spanning a substring 


A d,s 


The full symmetric outside mass for a state s at level d, 




spanning a substring 




The asymmetric inside mass for a parent state s at level d, starting at i 


and having a child-state u which returns control 




to parent or transits to new child-state at j 




The asymmetric outside mass, as a counterpart of 




asymmetric inside mass a d .'j(u) 




Potential functions. 



Table 1 : Notations used in this paper. 
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Notation Description 



A™ ax,<1 ' s The optimal potential function of the subset of variables <^ s 

A- The full version of A- ■ 

max 3 s d s 

a i ' s (u) The optimal potential function of the subset of variables Ci-fi u ) 

A ar g ,d, s The optimal ch0d u d+i of s 

a,rg, ,s/\ rp^ e p t j ma j child that transits to uf^ 1 and the time index t. 

I The set of optimal 'segments' at each level d. 



Table 2: Notations used in this section. 



Log-space equations 


Equations. 


i a max. d.s r-\ r^d.s . i max, d.s / \t 

logA j: . = max^sd+i {log E^ + log a i: - {u)\ 


Eq.|59] 


i max.d.s/ \ n max, d.s/ \ . 

loga l:J - (w) = max|max ie[j+lj] max^ggd+ijloga^^i ' 

. i a max.d+l,u i i /i c£,s 1 i a max.d-[-l,u . i d-\-l,s 1 

+ logA t:J - +log^„; u t _ 1 }; logA J:J - +log7r,J' | 


Eq.HU 


i max./J — l,s / \ n max,iJ — l.s / \ , 

lo & a i-.j (u) =max veS o{^ga i . j _' 1 {v)+ 

, i a max.D.it . i a D.s -i 

+ io g A. . +iog^,;, 7 _ 1 } 


Eq.Hll 


i max.d.s/ \ i a max.a+l.u. . i d.s 
lo g" i:l W=l°g A i:i + lo g 7r u , J 


Eq.|63] 



Table 3: MAP equations in the log-space. 



Alg. 


d = 2 


d = 3 


Alg. 


d = 2 


d = 3 


HSCRF 


100 


93.9 


PO-HSCRF 


80.2 


90.4 


DCRF 


96.5 


89.7 


PO-CRF 




83.5 


flat-CRF 




82.6 









Table 4: Accuracy (%) for fully observed data (left), and partially observed (PO) data 
(right). 



50 



Input: D, T, all the potential function values. 
Output: the bookkeepers; 

A a ff,' llS , for s £ S 1 and 1 < i < j < T; 

A°??' d ' s , for d e [2,Z?-l],se5 d ; 

A^ D ' S for s e S D and [1,T]; 

g^' d ' s (u) for de[l,D-l],ue S d+1 and 1 < i < j < T 

/* Initialisation */ 

A max,D,, = l for a jj . g ^ T] an( j g g 5 C 

/*Atthe level d=D-l */ 
For i = 1,2, ...,T 

For j = 1, ...,T 

Compute a™ ax,£,_1 " s (?i) using Equation [62] and 

a i T f ' D ~ 1 S ( u ) usm g Equation|67l 
Compute A" 1 ? 31 ' 0-1 ' 8 using Equation|59land 
A^' D ~ 1,S using Equation l64l 

EndFor 
EndFor 

/* The main recursion loops: bottom-up and forward */ 
Ford = £>-2,£>-3,...,l 
Fori = 1,2,...,T 

For j = i,i+ 1, ...,T 

If j = 2 

Compute a™ ax ' d,s (u) using Equationl63l 

Else 

Compute a™ ax ' d ' s (u) using EquationloTland 
a^f' ,s (u) using Equation 1651 

Endlf 

If d > 1 

Compute /\™ ax d s using Equation l59l and 
^arg, ,s usm g Bq U ationl64l 

Endlf 
EndFor 
EndFor 
EndFor 

Compute A™^ X ' 1:S using Equation[59land 
A al |/ ,s using Equation[64l 



Figure 16: Computing the bookkeepers. 
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Input: D, T, all the filled bookkeepers. 
Output: the optimal assignment £ MAP 

* Z. a max,l,s 

s = arg max^g^i A X . T 

Initialise triple buckets X 1 = {{s* , 1,T)} and 2^ = {} for d 6 [2,D] 
Ford = 1,2, 1 

For each triple (s* , i, j) in I d 
Let = A^< d ' s * 
For i < j 

If a^ s ' d ' s * (tt*) is defined Then 

(rx) = a*f d - s V) 

Add the triple (v*,t*,j) to and Set j = t* — 1 and u* = v* 

Else 

Add the triple (u*,i,j) to and Break this loop 
Endlf 
EndFor 
EndFor 
EndFor 

For each stored triple (s* in the bucket I d , for d £ [1 , D], 

create a corresponding set of variables (xf.j = s*, e^_! = 1, ef = 1, e|?j_ 1 = 0). 

The joining of these sets is the optimal assignment £ Mj4p 



Figure 17: Backtracking for optimal assignment (nested Markov blankets). 



Input: D, T, all the potential function values. 
Output: the optimal assignment £ MAP 

Run the bottom-up discrete optimisation procedure described in Figure 1 16T 
Run the top-down backtracking procedure described in Figure [TTI 



Figure 18: The generalised Viterbi algorithm. 
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Input: D, T and all the contextual potentials. 

Output: Scaled quantities: inside/asymmetric inside masses, 

outside/asymmetric outside masses. 
Forj = l,2,..,T 

Compute aflj(u), d G [1,D - 1] using Equations l40l RTI and l42l 
Compute kj using Equation II 151 
Rescale a^{u) using Equation ll 161 
Fori = 1, 2, .., j 

For d = 2, 3, ..,£>- 1 

Rescale ctf'^u) using Equation ll 161 
Rescale A-.'J using Equation ll 171 
EndFor 
EndFor 

Rescale A^j' s using Equation ll 181 
EndFor 

Compute true log-partition function using Equation ll 141 
Compute the outside/asymmetric outside masses using the 

scaled inside/asymmetric inside masses instead of the original 
inside/asymmetric inside in Equations [47] and [51] 



Figure 19: Scaling algorithm to avoid numerical overflow. 




Figure 20: The topo learned from data 
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Figure 21: The state transition model learned from data. Primitive states are duplicated 
for clarity only. They are shared among complex states 
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Figure 22: Performance of the constrained max-product algorithm as a function of 
available information on label/start/end time. 
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number of training sentences 



Figure 23: Performance of various models on Conll2000 noun-phrase chunking. 
HSCRF+POS and DCRF+POS mean HSCRF and DCRF with POS given at test time, 
respectively. 



state persistence state transition 




Figure 24: The SemiCRFs in our contextual clique framework. 
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Input: T, the transition potentials and the state-persistence potentials. 
Output: Scaled quantities: state-persistence potentials, forward/backward. 
Forj = l,2,..,T 

/'^Partial scaling*/ 

For i = j — L + 1, — 1 

Rescale R? :j , <- , H[ ) re* 

EndFor 

Compute Oj (s) using Equation |138l 

Compute Kj = J2 S a j ( s ) 

/*Full scaling*/ 

Rescale cxj(s) <— aj(s)/nj 

Fori = j -L + 

Rescale /«';., Rf.j/Kj 
EndFor 
EndFor 

Compute true log-partition function using Equation ll 141 
Compute the backward/ESSes using the scaled potentials. 



Figure 25: Scaling SemiCRF. 
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